Relational datasets are being generated at an alarmingly rapid rate acrossorganizations and industries. Compressing these datasets could significantlyreduce storage and archival costs. Traditional compression algorithms, e.g.,gzip, are suboptimal for compressing relational datasets since they ignore thetable structure and relationships between attributes. We study compression algorithms that leverage the relational structure tocompress datasets to a much greater extent. We develop Squish, a system thatuses a combination of Bayesian Networks and Arithmetic Coding to capturemultiple kinds of dependencies among attributes and achieve near-entropycompression rate. Squish also supports user-defined attributes: users caninstantiate new data types by simply implementing five functions for a newclass interface. We prove the asymptotic optimality of our compressionalgorithm and conduct experiments to show the effectiveness of our system:Squish achieves a reduction of over 50\% in storage size relative to systemsdeveloped in prior work on a variety of real datasets.
展开▼